Get started with building a model in this R Markdown document that accompanies Evaluate your model with resampling tidymodels start article.
If you ever get lost, you can click on the header of the knitted document to see the accompanying section in the online article.
Take advantage of the RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks.
Load necessary packages:
library(tidymodels) # for the rsample package, along with the rest of tidymodels
# Helper packages
library(modeldata) # for the cells data
Load cell image data:
data(cells, package = "modeldata")
cells
## # A tibble: 2,019 x 58
## case class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
## <fct> <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Test PS 143. 185 15.7 4.95 9.55
## 2 Train PS 134. 819 31.9 207. 69.9
## 3 Train WS 107. 431 28.0 116. 63.9
## 4 Train PS 69.2 298 19.5 102. 28.2
## 5 Test PS 2.89 285 24.3 112. 20.5
## # … with 2,014 more rows, and 51 more variables: avg_inten_ch_4 <dbl>,
## # convex_hull_area_ratio_ch_1 <dbl>, convex_hull_perim_ratio_ch_1 <dbl>,
## # diff_inten_density_ch_1 <dbl>, diff_inten_density_ch_3 <dbl>,
## # diff_inten_density_ch_4 <dbl>, entropy_inten_ch_1 <dbl>,
## # entropy_inten_ch_3 <dbl>, entropy_inten_ch_4 <dbl>,
## # eq_circ_diam_ch_1 <dbl>, eq_ellipse_lwr_ch_1 <dbl>,
## # eq_ellipse_oblate_vol_ch_1 <dbl>, eq_ellipse_prolate_vol_ch_1 <dbl>,
## # eq_sphere_area_ch_1 <dbl>, eq_sphere_vol_ch_1 <dbl>,
## # fiber_align_2_ch_3 <dbl>, fiber_align_2_ch_4 <dbl>,
## # fiber_length_ch_1 <dbl>, fiber_width_ch_1 <dbl>, inten_cooc_asm_ch_3 <dbl>,
## # inten_cooc_asm_ch_4 <dbl>, inten_cooc_contrast_ch_3 <dbl>,
## # inten_cooc_contrast_ch_4 <dbl>, inten_cooc_entropy_ch_3 <dbl>,
## # inten_cooc_entropy_ch_4 <dbl>, inten_cooc_max_ch_3 <dbl>,
## # inten_cooc_max_ch_4 <dbl>, kurt_inten_ch_1 <dbl>, kurt_inten_ch_3 <dbl>,
## # kurt_inten_ch_4 <dbl>, length_ch_1 <dbl>, neighbor_avg_dist_ch_1 <dbl>,
## # neighbor_min_dist_ch_1 <dbl>, neighbor_var_dist_ch_1 <dbl>,
## # perim_ch_1 <dbl>, shape_bfr_ch_1 <dbl>, shape_lwr_ch_1 <dbl>,
## # shape_p_2_a_ch_1 <dbl>, skew_inten_ch_1 <dbl>, skew_inten_ch_3 <dbl>,
## # skew_inten_ch_4 <dbl>, spot_fiber_count_ch_3 <int>,
## # spot_fiber_count_ch_4 <dbl>, total_inten_ch_1 <int>,
## # total_inten_ch_2 <dbl>, total_inten_ch_3 <int>, total_inten_ch_4 <int>,
## # var_inten_ch_1 <dbl>, var_inten_ch_3 <dbl>, var_inten_ch_4 <dbl>,
## # width_ch_1 <dbl>
Look at proportion of classes:
cells %>%
count(class) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## class n prop
## <fct> <int> <dbl>
## 1 PS 1300 0.644
## 2 WS 719 0.356
Define a split object stratified by class column:
set.seed(123)
cell_split <- initial_split(cells %>% select(-case),
strata = class)
Apply the split to obtain training (cell_train) and test (cell_test) sets.
cell_train <- training(cell_split)
cell_test <- testing(cell_split)
nrow(cell_train)
## [1] 1515
nrow(cell_train)/nrow(cells)
## [1] 0.7503715
# training set proportions by class
cell_train %>%
count(class) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## class n prop
## <fct> <int> <dbl>
## 1 PS 975 0.644
## 2 WS 540 0.356
# test set proportions by class
cell_test %>%
count(class) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## class n prop
## <fct> <int> <dbl>
## 1 PS 325 0.645
## 2 WS 179 0.355
We will work with the training set data for the majority of the modeling steps.
This time we don’t need to preprocess the data as much, therefore we will not use a recipe. Let’s create the model specification for a random forest model, using ranger engine and set mode to classification.
rf_mod <-
rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("classification")
See ?rand_forest for possibile engines, modes and further details.
Now let’s fit the model on our training dataset using the formula:
set.seed(234)
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
rf_fit
## parsnip model object
##
## Fit time: 2.5s
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 1515
## Number of independent variables: 56
## Mtry: 7
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1218873
What happens if we evaluate model performance with the training data?
Predict the same fitted model with training dataset:
rf_training_pred <-
predict(rf_fit, cell_train) %>%
bind_cols(predict(rf_fit, cell_train, type = "prob")) %>%
# Add the true outcome data back in
bind_cols(cell_train %>%
select(class))
Look at the performance results:
rf_training_pred %>% # training set predictions
roc_auc(truth = class, .pred_PS)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 1.00
rf_training_pred %>% # training set predictions
accuracy(truth = class, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.993
Now proceed to the test set:
rf_testing_pred <-
predict(rf_fit, cell_test) %>%
bind_cols(predict(rf_fit, cell_test, type = "prob")) %>%
bind_cols(cell_test %>% select(class))
And look at performance results from prediction with the test set:
rf_testing_pred %>% # test set predictions
roc_auc(truth = class, .pred_PS)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.909
rf_testing_pred %>% # test set predictions
accuracy(truth = class, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.837
Whoops! Our performance results with the training set were a little too good to be true!
Create cross-validation (CV) folds (for 10-fold CV) using vfold_cv() from the rsample package:
set.seed(345)
folds <- vfold_cv(cell_train, v = 10)
folds
## # 10-fold cross-validation
## # A tibble: 10 x 2
## splits id
## <named list> <chr>
## 1 <split [1.4K/152]> Fold01
## 2 <split [1.4K/152]> Fold02
## 3 <split [1.4K/152]> Fold03
## 4 <split [1.4K/152]> Fold04
## 5 <split [1.4K/152]> Fold05
## 6 <split [1.4K/151]> Fold06
## 7 <split [1.4K/151]> Fold07
## 8 <split [1.4K/151]> Fold08
## 9 <split [1.4K/151]> Fold09
## 10 <split [1.4K/151]> Fold10
Do you recall working with workflow() from the second article, Preprocess your data with recipes?
Use a workflow() that bundles together the random forest model and a formula.
rf_wf <-
workflow() %>%
add_model(rf_mod) %>%
add_formula(class ~ .)
Now apply the workflow and fit the model with each fold:
set.seed(456)
rf_fit_rs <-
rf_wf %>%
fit_resamples(folds)
rf_fit_rs
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [1.4K/152]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
## 2 <split [1.4K/152]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
## 3 <split [1.4K/152]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
## 4 <split [1.4K/152]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
## 5 <split [1.4K/152]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
## 6 <split [1.4K/151]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
## 7 <split [1.4K/151]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
## 8 <split [1.4K/151]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
## 9 <split [1.4K/151]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
## 10 <split [1.4K/151]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>
Do you see the added columns .metrics and .notes?
Collect and summarize performance metrics from all 10 folds:
collect_metrics(rf_fit_rs)
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 accuracy binary 0.833 10 0.0111
## 2 roc_auc binary 0.903 10 0.00842
Now these are more realistic results!